First, get all depencies and packages necessary to complete all tasks.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))
BiocManager::install("GEOmetadb")
if (!requireNamespace("kableExtra", quietly = TRUE))
BiocManager::install("kableExtra")
if (!requireNamespace("gridExtra", quietly = TRUE))
install.packages("gridExtra")
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
BiocManager::install("org.Hs.eg.db")
if (!requireNamespace("knitr", quietly = TRUE))
install.packages("knitr")
if (!requireNamespace("biomaRt", quietly = TRUE))
BiocManager::install("biomaRt")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
BiocManager::install("circlize")
if (!requireNamespace("limma", quietly = TRUE))
BiocManager::install("limma")
if (!requireNamespace("RCurl", quietly = TRUE))
install.packages("RCurl")
if (!requireNamespace("Biobase", quietly = TRUE))
install.packages("Biobase")
if (!requireNamespace("ggplot2", quietly = TRUE))
install.packages("ggplot2")
if (!requireNamespace("httr", quietly = TRUE))
install.packages("httr")
if (!requireNamespace("RJSONIO", quietly = TRUE))
install.packages("RJSONIO")
if (!requireNamespace("RCy3", quietly = TRUE))
BiocManager::install("RCy3")
library(RCy3)
library(httr)
library(RJSONIO)
library(ggplot2)
library(Biobase)
library(RCurl)
library(edgeR)
library(limma)
library(circlize)
library(ComplexHeatmap)
library(gridExtra)
library(knitr)
library(kableExtra)
library(GEOmetadb)
library(biomaRt)
library(org.Hs.eg.db)
The Expression Dataset I chose was GSE66306: Impact of bariatric surgery on RNA-seq gene expression profiles of peripheral monocytes in humans(Poitou et al. 2015).
Summary
Genome expression profiles were taken from obese women before, and three months after bariatric surgery.
Control and Test Conditions
The conditions that were tested were:
Before bariatric surgery (T0)
3 months after bariatric surgery (T3)
Why I was Interested
I have always been interested in health, and maintaining a healthy lifestyle; I was a competitive gymnast until I came to university. I always knew being healthy (whether that means eating healthily or being active / working out) had amazing benefits on your physical, mental, and emotional health. So a study that focused on health, and how a procedure like bariatric surgery can improve someone’s physical health, was of high interest to me.
Downloading the Dataset and initial previewing
Now that we have looked into information about the dataset, let’s download my dataset using BiocManager(Morgan 2019) and GEOmetadb(Zhu et al. 2008)!
sfiles = getGEOSuppFiles('GSE66306')
fnames = rownames(sfiles)
PM_exp = read.delim(fnames[1],header=TRUE,
check.names = FALSE, stringsAsFactors = FALSE)
Let’s first find out the dimensions of my dataset: 23354, 40
This indicates that there are 23354 rows and 40 columns!
Let’s take a quick look at what the first few rows and columns of my data looks like:
| Gene Name | Ensembl Gene ID | PM01_T0 | PM01_T3 | PM02_T0 | PM02_T3 |
|---|---|---|---|---|---|
| 1/2-SBSRNA4 | NA | 62.71930 | 62.29415 | 66.85663 | 41.28218 |
| A1BG | ENSG00000121410 | 97.01571 | 87.47226 | 67.36065 | 83.48856 |
| A1BG-AS1 | ENSG00000268895 | 63.30869 | 69.03770 | 62.58948 | 26.72385 |
| A1CF | ENSG00000148584 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| A2LD1 | NA | 174.01275 | 155.20349 | 71.78328 | 90.46922 |
A quick summary of what I observed about my dataset:
There are 23354 genes
There are gene names, Ensemble gene IDs, and 12 different test cases (two situations per patient, per gene)
Not all genes have Ensembl gene IDs
The gene names used are either the HUGO approved symbol or an alias symbol
Organize the dataset into patient IDs and cell types
Before doing further analysis of the dataset, I first want to create a table that lists all patients and easily displays the patient ID as well as the specific cell type analyzed.
| patients | time | |
|---|---|---|
| PM01_T0 | PM01 | T0 |
| PM01_T3 | PM01 | T3 |
| PM02_T0 | PM02 | T0 |
| PM02_T3 | PM02 | T3 |
| PM05_T0 | PM05 | T0 |
| PM05_T3 | PM05 | T3 |
| PM06_T0 | PM06 | T0 |
| PM06_T3 | PM06 | T3 |
| PM08_T0 | PM08 | T0 |
| PM08_T3 | PM08 | T3 |
Filter weakly expressed features from my dataset
Now, back to my dataset. I want to filter out weakly expressed features, using edgeR:
The filtered dimesions of the dataset now are: 13083, 40.
This means that 10271 genes were removed. That means there were 10271 outliers.
Edit the HUGO gene symbols and Ensembl Gene IDs
As mentioned above, some of the genes are missing Ensembl gene IDs. This is a large issue and I had lots of difficulty trying to salvage as many genes as I could that were missing the Ensembl gene IDs.
First, I tried to separate the genes that were missing ensembl gene IDs from the other genes:
na_gene_ids <- PM_exp_filtered[which(is.na(PM_exp_filtered$`Ensembl Gene ID`)), 1]
There are 1064 genes without ensembl gene ids!
I also read in the paper that they used hg19 instead of the most recent ensembl. Therefore, after some google searching, I came across this article that states that hg19 is equivalent to Ensembl’s GRCh37. As we were shown how to use Ensembl, I went with GRCh37 for all future queries.
Method 1: Match the gene names given in dataset to Ensembl IDs
In my dataset I was given gene ids - some of these were the same as HUGO symbols, while some were aliases or older symbols. I tried to use these to find the associated ensembl gene ids:
grch37 = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice") # From https://support.bioconductor.org/p/62064/
ensembl_grch37 = useDataset("hsapiens_gene_ensembl",mart=grch37)
is_na <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
filters = c("wikigene_name"),
values = na_gene_ids,
mart = ensembl_grch37)
I was fortunate to find 317 of my ensembl ids. I put them back into the dataset by:
for (i in 1:nrow(is_na)) {
gene_name <- is_na[i,]$wikigene_name
ensembl_gene_id <- is_na[i,]$ensembl_gene_id
index <- which(PM_exp_filtered$`Gene Name` == gene_name)
PM_exp_filtered[index,2] <- ensembl_gene_id
}
still_na <- na_gene_ids[which(!na_gene_ids %in% is_na$wikigene_name)] #remove all now identified gene names
Now, I am missing 808 ensembl ids.
Method 2: Use entrez gene ids on genes that begin with LOC
In my dataset there are quite a few genes that begin with the letters LOC. Dr. Isserlin suggested that if the LOC is removed, these ids can be used as entrez gene ids! I then separated all gene ids that began with LOC and performed a query to use the numbers from the gene ids (that began with LOC) to find matching ensembl gene ids:
LOC_indexes <- grep("^LOC", still_na) # Find all gene names beginning with LOC
LOC_names <- still_na[LOC_indexes]
no_LOC_names <- gsub("^LOC", "", LOC_names) # Remove all of the LOC from every gene name beginning with LOC
length(no_LOC_names) #362
## [1] 362
LOC_grch37 <- getBM(attributes = c("entrezgene_id", "ensembl_gene_id"),
filters = c("entrezgene_id"),
values = no_LOC_names,
mart = ensembl_grch37)
I was able to find 245. Now I will put them back into my dataset by:
for (i in 1:nrow(LOC_grch37)) {
gene_name <- paste0("LOC", toString(LOC_grch37[i,]$entrezgene_id)) # Add back LOC that they will match with gene names
ensembl_gene_id <- is_na[i,]$ensembl_gene_id
index <- which(PM_exp_filtered$`Gene Name` == gene_name)
PM_exp_filtered[index,2] <- ensembl_gene_id
}
left_LOC_na <- no_LOC_names[which(!no_LOC_names %in% LOC_grch37$entrezgene_id)] # Find all gene names that start with LOC
left_na <- still_na[-LOC_indexes] # Remove all gene names that start with LOC from the <NA> list
I am still left with 446 to attempt to find the ensembl gene ids for. I removed all ids that began with LOC from the list of indices I have left to check as that was the only check that would work in finding ensembl gene ids for genes beginning with LOC.
Method 3: Use list of known aliases to match with dataset gene names
When I was trying to find a solution to my missing ensembl ids, I came across this website and decided to use this as well! I will try and find proper gene names that map to my dataset’s gene names, and use those to find ensembl gene ids.
# To get the list of gene names and aliases
dbCon <- org.Hs.eg_dbconn()
sqlQuery <- 'SELECT * FROM alias, gene_info WHERE alias._id == gene_info._id;'
aliasSymbol <- dbGetQuery(dbCon, sqlQuery)
m <- matrix(ncol=2, byrow=TRUE)
colnames(m) <- c('old_symbol', 'new_symbol') # Old symbol is our gene name, new symbol is matching gene name
all_new_symbols <- c()
for (val in left_na) {
if (val %in% aliasSymbol$alias_symbol) {
index <- which(aliasSymbol$alias_symbol == val)
proper_symbol <- aliasSymbol[index,]$symbol[1]
m <- rbind(m, c(val, proper_symbol)) #to form association b/w the two
all_new_symbols <- c(all_new_symbols, proper_symbol) #for next step, to match ensembl gene ids with
}
}
# Get the ensembl gene ids that map to the new gene names
ensembl_w_new_names <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
filters = c("wikigene_name"),
values = all_new_symbols,
mart = ensembl_grch37)
# Now, put all of the ensembl gene IDs into the chart
for (i in 1:nrow(ensembl_w_new_names)) {
gene_name <- ensembl_w_new_names[i,]$wikigene_name # The new name we matched with our gene names
old_gene_name <- m[which(m[,2] == gene_name)] # Gene names in our dataset
ensembl_gene_id <- ensembl_w_new_names[i,]$ensembl_gene_id
index <- which(PM_exp_filtered$`Gene Name` == old_gene_name)
PM_exp_filtered[index,2] <- ensembl_gene_id
}
## Warning in PM_exp_filtered$`Gene Name` == old_gene_name: longer object length is
## not a multiple of shorter object length
This is the last method I could find. Even though there are still some genes that are missing ensembl ids, I will leave them in my dataset as they do have some form of identification, though the gene ids used may be aliases or older hugo symbols.
Finally, to actually find the HUGO symbols that map to all of thse ensembl gene ids and add them to the dataset:
# Find the HUGO symbols
all_HUGO <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
filters = c("ensembl_gene_id"),
values = PM_exp_filtered$`Ensembl Gene ID`,
mart = ensembl_grch37)
PM_exp_filtered$"HUGO_symbol" <- NA # Add HUGO column to dataset
# Put hugo symbols into the dataset
for (i in 1:nrow(all_HUGO)) {
ensembl_num <- all_HUGO[i,]$ensembl_gene_id
hugo_sym <- all_HUGO[i,]$hgnc_symbol
index <- which(PM_exp_filtered$`Ensembl Gene ID` == ensembl_num)
PM_exp_filtered$"HUGO_symbol"[index] <- hugo_sym
}
Now, it is time to check for duplicates!
PM_table <- data.frame(table(PM_exp_filtered$`Ensembl Gene ID`))
all_duplicates <- PM_exp_filtered[PM_exp_filtered$`Ensembl Gene ID` %in% PM_table$Var1[PM_table$Freq > 1],] #check which ensembl ids have a higher frequency than 1, meaning they are duplicated
length(all_duplicates$`Gene Name`) #476
## [1] 476
I can see that my dataset has 476 duplicates! I want to see which of my genes are duplicates:
gene_duplicates <- all_duplicates$`Gene Name`
all_duplicates$`Gene Name`
## [1] "ACBD6" "ACPL2" "AGAP5" "AGAP6"
## [5] "AGAP7" "AGAP8" "AGAP9" "ARMCX5-GPRASP2"
## [9] "ASB3" "ATP1A1OS" "AZI1" "BMS1P5"
## [13] "C10orf118" "C11orf35" "C11orf82" "C12orf23"
## [17] "C12orf44" "C17orf103" "C17orf72" "C19orf55"
## [21] "C19orf59" "C1QTNF6" "C1R" "C1orf115"
## [25] "C1orf162" "C1orf204" "C1orf216" "C1orf233"
## [29] "C1orf27" "C1orf85" "C20orf112" "C20orf201"
## [33] "C22orf26" "C2orf62" "C3orf17" "C3orf18"
## [37] "C3orf37" "C3orf38" "C3orf58" "C3orf62"
## [41] "C4orf21" "C5orf54" "C8orf37" "C8orf44"
## [45] "C8orf44-SGK3" "C8orf45" "C8orf46" "C8orf58"
## [49] "C8orf59" "C8orf76" "C8orf80" "C8orf82"
## [53] "CCDC144B" "CCDC144C" "CXXC1" "CXXC5"
## [57] "CXorf26" "CXorf36" "CXorf56" "CXorf65"
## [61] "CXorf69" "ECRP" "EIF4A2" "FAM105B"
## [65] "FAM211A" "FAM211B" "FAM21A" "FAM21B"
## [69] "FAM21C" "FLJ13197" "FLJ31813" "FLJ33630"
## [73] "FLJ41200" "FLJ45513" "GIMAP1-GIMAP5" "GIMAP5"
## [77] "GOLGA6L10" "GOLGA6L9" "GPR75-ASB3" "GTF2H2C"
## [81] "GTF2H2D" "HIST2H4A" "HIST2H4B" "HOXA9"
## [85] "IL8" "IPW" "ITFG2" "KGFLP2"
## [89] "KIAA0947" "KIAA1009" "KIAA1737" "KIAA1804"
## [93] "LIMS1" "LIMS3L" "LINC00341" "LOC100009676"
## [97] "LOC100128071" "LOC100128191" "LOC100128338" "LOC100128531"
## [101] "LOC100128682" "LOC100128788" "LOC100129196" "LOC100129250"
## [105] "LOC100129269" "LOC100129387" "LOC100129931" "LOC100129961"
## [109] "LOC100130557" "LOC100130581" "LOC100130691" "LOC100130855"
## [113] "LOC100131067" "LOC100131089" "LOC100131096" "LOC100131193"
## [117] "LOC100131434" "LOC100131691" "LOC100131733" "LOC100132077"
## [121] "LOC100132247" "LOC100132273" "LOC100132526" "LOC100132832"
## [125] "LOC100133445" "LOC100133612" "LOC100133991" "LOC100134229"
## [129] "LOC100134368" "LOC100134713" "LOC100190938" "LOC100233156"
## [133] "LOC100233209" "LOC100268168" "LOC100270804" "LOC100271836"
## [137] "LOC100272216" "LOC100272228" "LOC100286844" "LOC100287015"
## [141] "LOC100287042" "LOC100287177" "LOC100287225" "LOC100287314"
## [145] "LOC100287559" "LOC100287722" "LOC100287792" "LOC100288069"
## [149] "LOC100288123" "LOC100288198" "LOC100288432" "LOC100288637"
## [153] "LOC100288748" "LOC100288778" "LOC100288846" "LOC100289019"
## [157] "LOC100289473" "LOC100289561" "LOC100293534" "LOC100294362"
## [161] "LOC100302401" "LOC100306951" "LOC100329109" "LOC100335030"
## [165] "LOC100379224" "LOC100499177" "LOC100499484" "LOC100505495"
## [169] "LOC100505576" "LOC100505648" "LOC100505666" "LOC100505676"
## [173] "LOC100505678" "LOC100505702" "LOC100505715" "LOC100505761"
## [177] "LOC100505776" "LOC100505783" "LOC100505812" "LOC100505854"
## [181] "LOC100505865" "LOC100505989" "LOC100506046" "LOC100506054"
## [185] "LOC100506083" "LOC100506085" "LOC100506100" "LOC100506123"
## [189] "LOC100506124" "LOC100506497" "LOC100506688" "LOC100506710"
## [193] "LOC100506714" "LOC100506746" "LOC100506844" "LOC100506866"
## [197] "LOC100506930" "LOC100506963" "LOC100506994" "LOC100507032"
## [201] "LOC100507034" "LOC100507062" "LOC100507217" "LOC100507246"
## [205] "LOC100507373" "LOC100507392" "LOC100507404" "LOC100507423"
## [209] "LOC100507424" "LOC100507463" "LOC100507495" "LOC100507501"
## [213] "LOC100507557" "LOC100507567" "LOC100507577" "LOC100507632"
## [217] "LOC100509894" "LOC100527964" "LOC100616668" "LOC100630923"
## [221] "LOC100652999" "LOC100861402" "LOC144571" "LOC145783"
## [225] "LOC145820" "LOC147670" "LOC147727" "LOC147804"
## [229] "LOC148189" "LOC149837" "LOC150776" "LOC151174"
## [233] "LOC151475" "LOC151534" "LOC152217" "LOC152225"
## [237] "LOC153684" "LOC154761" "LOC155060" "LOC158257"
## [241] "LOC158572" "LOC200772" "LOC202181" "LOC202781"
## [245] "LOC219347" "LOC220729" "LOC220906" "LOC221442"
## [249] "LOC253039" "LOC254100" "LOC254128" "LOC255512"
## [253] "LOC257396" "LOC283089" "LOC283143" "LOC283299"
## [257] "LOC283624" "LOC283663" "LOC283693" "LOC283888"
## [261] "LOC284023" "LOC284385" "LOC284440" "LOC284454"
## [265] "LOC284551" "LOC284751" "LOC284837" "LOC284865"
## [269] "LOC284950" "LOC285033" "LOC285074" "LOC285103"
## [273] "LOC285359" "LOC285540" "LOC285696" "LOC285954"
## [277] "LOC285965" "LOC285972" "LOC286059" "LOC286186"
## [281] "LOC286467" "LOC338758" "LOC338799" "LOC338817"
## [285] "LOC339166" "LOC339290" "LOC339524" "LOC339803"
## [289] "LOC339874" "LOC340037" "LOC340515" "LOC340544"
## [293] "LOC349196" "LOC374443" "LOC375190" "LOC386597"
## [297] "LOC387646" "LOC387647" "LOC388387" "LOC389634"
## [301] "LOC389641" "LOC400027" "LOC400604" "LOC400657"
## [305] "LOC400891" "LOC400958" "LOC401074" "LOC401093"
## [309] "LOC401320" "LOC401397" "LOC401588" "LOC439994"
## [313] "LOC440173" "LOC440288" "LOC440300" "LOC440600"
## [317] "LOC440944" "LOC440970" "LOC441666" "LOC442028"
## [321] "LOC541471" "LOC541473" "LOC550112" "LOC550643"
## [325] "LOC595101" "LOC613037" "LOC619207" "LOC641298"
## [329] "LOC641367" "LOC642361" "LOC642852" "LOC643406"
## [333] "LOC643529" "LOC643623" "LOC643733" "LOC643770"
## [337] "LOC643837" "LOC644172" "LOC644242" "LOC644714"
## [341] "LOC645166" "LOC645212" "LOC645513" "LOC645676"
## [345] "LOC645752" "LOC646214" "LOC646329" "LOC646762"
## [349] "LOC646851" "LOC648740" "LOC648987" "LOC652276"
## [353] "LOC653061" "LOC653160" "LOC653501" "LOC653712"
## [357] "LOC678655" "LOC728024" "LOC728190" "LOC728431"
## [361] "LOC728537" "LOC728554" "LOC728558" "LOC728606"
## [365] "LOC728730" "LOC728752" "LOC728855" "LOC729678"
## [369] "LOC729737" "LOC729799" "LOC730091" "LOC730102"
## [373] "LOC730227" "LOC731424" "LOC731789" "LOC92249"
## [377] "LOC93622" "LOC96610" "LONP2" "LSMD1"
## [381] "MATR3" "MGC21881" "MGC57346" "MIR1248"
## [385] "MIR3653" "MRP63" "MST4" "NAA38"
## [389] "NBPF11" "NBPF16" "NBPF24" "NS3BP"
## [393] "OSTC" "OSTCP1" "PAR-SN" "PARPBP"
## [397] "PARS2" "PIDD" "PLSCR3" "PMS2P5"
## [401] "PRH1-PRR4" "PRKRIP1" "PROSC" "PROSER1"
## [405] "PSIP1" "PSKH1" "SGK3" "SNHG4"
## [409] "SPDYE2" "SPDYE5" "TARS" "TARS2"
## [413] "THADA" "THAP1" "TMED1" "TMED10"
## [417] "TMEM184A" "TMEM184B" "TMEM5" "TMEM50A"
## [421] "TMEM86A" "TMEM86B" "TNFAIP8L3" "TNFRSF10A"
## [425] "TREML4" "TRERF1" "TRIM8" "TRIM9"
## [429] "TSG101" "TSGA10" "TXN" "TXN2"
## [433] "UBE2O" "UBE2Q1" "UGCG" "UGDH"
## [437] "UQCR10" "UQCR11" "UQCRQ" "URB1"
## [441] "WASH2P" "WASH3P" "WDR7" "WDR70"
## [445] "ZFP14" "ZFP161" "ZHX2" "ZHX3"
## [449] "ZNF169" "ZNF17" "ZNF189" "ZNF19"
## [453] "ZNF195" "ZNF197" "ZNF253" "ZNF254"
## [457] "ZNF26" "ZNF260" "ZNF281" "ZNF436"
## [461] "ZNF438" "ZNF500" "ZNF501" "ZNF543"
## [465] "ZNF544" "ZNF671" "ZNF672" "ZNF674"
## [469] "ZNF675" "ZNF736" "ZNF737" "ZNF768"
## [473] "ZNF77" "ZNHIT1" "ZNHIT2" "ZNRF3"
If I remember correctly, the gene names listed here look an awful lot like the list of gene names who origianlly were missing ensemble gene ids! Let’s check:
orig_missing_ensembl <- length(which(gene_duplicates %in% PM_exp$`Gene Name`[is.na(PM_exp$`Ensembl Gene ID`)]))
length(which(gene_duplicates %in% PM_exp$`Gene Name`[is.na(PM_exp$`Ensembl Gene ID`)])) #406
## [1] 406
Wow! Most of the genes that are duplicates originally had no ensembl gene ids! As these duplicates make up around 3% of my dataset, I am going to leave all of these values in. I don’t feel comfortable removing genes, especially when I am unsure of the fact that the genes that are duplicated are being mapped 100% correctly.
Before perfoming any normalization on my dataset, I just wanted to be able to visualize my data.
I chose to use a boxplot because I found it to be the easiest representation to view the data as it showed distributions of each sample’s (PM_#) values and lots of information about them in one plot (interquartile range, first and third quartiles, and outliers).
I also used a denstiy plot as it is similar to a histogram, but you are able to easily view the distribution of data over a continuous interval of patient’s expression of the genes.
Now that I have been able to get an overview of what my data looks like, I will normalize the data:
filtered_data_matrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(filtered_data_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
d = DGEList(counts=filtered_data_matrix, group=samples$time)
d = calcNormFactors(d)
normalized_counts <- cpm(d)
A few of the normalized factors can be displayed:
| group | lib.size | norm.factors | |
|---|---|---|---|
| PM01_T0 | T0 | 33572965 | 0.8029624 |
| PM01_T3 | T3 | 32940023 | 0.9024528 |
| PM02_T0 | T0 | 28698158 | 1.0369125 |
| PM02_T3 | T3 | 28868780 | 1.0366107 |
| PM05_T0 | T0 | 24130290 | 0.9819000 |
We can see that there will be minor modifications to the dataset, but these modifications will still have a slight impact (as seen from the norm.factors column).
From this plot, I can automatically see that all patients, before and after surgery had very similar interquartile ranges, with a mean around 4. There seemed to be quite a few outliers, many on the more negative side, indicating much lower expression occurred slightly more frequently than very high expression.
The differences between pre-normalization and post-normalization are very minimal, especially in this graph. The different lines indiciating different patients are tigher together, however the mean has not shifted much. It seems that most pateints gene expression hovers around 0.18, with PM02_T3 dipping slightly lower at 0.15.
Now we have caught up with how the dataset was normalized and cleaned, and have a copy of it for further analysis!
kable(PM_exp_filtered[1:10,1:10], format = "html")
| Gene Name | Ensembl Gene ID | PM01_T0 | PM01_T3 | PM02_T0 | PM02_T3 | PM05_T0 | PM05_T3 | PM06_T0 | PM06_T3 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1/2-SBSRNA4 | ENSG00000247950 | 62.71930 | 62.29415 | 66.85663 | 41.28218 | 50.56757 | 22.50039 | 33.88546 | 25.70598 |
| 2 | A1BG | ENSG00000121410 | 97.01571 | 87.47226 | 67.36065 | 83.48856 | 186.39012 | 165.66641 | 130.99126 | 136.03277 |
| 3 | A1BG-AS1 | ENSG00000268895 | 63.30869 | 69.03770 | 62.58948 | 26.72385 | 118.38336 | 118.66857 | 64.01149 | 68.88014 |
| 5 | A2LD1 | ENSG00000134864 | 174.01275 | 155.20349 | 71.78328 | 90.46922 | 118.88036 | 117.50502 | 102.13343 | 111.10433 |
| 6 | A2M | ENSG00000175899 | 19.36841 | 16.66238 | 37.17075 | 18.27523 | 15.14377 | 37.73556 | 32.69034 | 26.84309 |
| 13 | AAAS | ENSG00000094914 | 595.53115 | 596.03090 | 494.07056 | 567.92197 | 476.07756 | 473.95374 | 488.67144 | 430.47829 |
| 14 | AACS | ENSG00000081760 | 174.93491 | 220.07247 | 213.37203 | 195.32551 | 218.75090 | 229.04913 | 174.27876 | 224.90844 |
| 21 | AAGAB | ENSG00000103591 | 932.06316 | 939.71988 | 994.14134 | 1006.52816 | 714.96474 | 747.77940 | 838.74404 | 723.04105 |
| 22 | AAK1 | ENSG00000115977 | 1318.43352 | 2106.26480 | 2545.63080 | 2495.16544 | 1405.69820 | 1428.93760 | 1780.92956 | 1493.12287 |
| 23 | AAMP | ENSG00000127837 | 2110.10878 | 2056.08450 | 1639.15665 | 1687.72253 | 1380.47480 | 1649.38669 | 1505.34146 | 1429.78255 |
Now I will begin analysis of gene expression from my dataset.
Before performing any analysis, I will be begin with plotting an MDS graph to analyze the culstering between patients before the surgery and three hours after the surgery.
# Make certain modifications to the dataset to remove issues of errors later in the report
#Only incude the genes whose ensembl gene ids are not NULL
PM_exp_filtered <- PM_exp_filtered[-which(is.na(PM_exp_filtered$`Ensembl Gene ID`)),]
#Get rid of all duplicates as well
PM_exp_filtered <- PM_exp_filtered[-which(PM_exp_filtered$`Gene Name` %in% all_duplicates$`Gene Name`),]
heatmap_matrix <- PM_exp_filtered[,4:ncol(PM_exp_filtered)-1]
rownames(heatmap_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
colnames(heatmap_matrix) <- colnames(PM_exp_filtered[,4:ncol(PM_exp_filtered)-1])
plotMDS(heatmap_matrix, col = rep(c("darkgreen","blue"),10))
As shown in the graph, there is no two clusterings of patients before (T0) surgery and after (T3) surgery. It seems that most patients cluster just after 0 on the x-axis, for both results of before or after surgery. This indicates that my dataset is not the most ideal dataset, as in an ideal dataset the two treatment cases (T0 and T3) are in two very separate clusters. This may just add noise to my results later on in this process.
To have a further look at my dataset, I will plot an MDS graph and analyze the clustering between each individual patient before and after surgery.
pat_colors <- rainbow(10)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
plotMDS(heatmap_matrix, col = pat_colors )
As shown in the graph, there is a slight clustering within patients before (T0) and after surgery (T3). This is not ideal within a dataset, since patients should not cluster around eachother in the treatment and control cases (T0 and T3). Again, this indicates that noise will be present and may be an issue later on in the gene expression analysis.
Now, I will begin gene expression analysis of my dataset.
First, I will create a linear model using Limma(Ritchie et al. 2015).
model_design <- model.matrix(~ samples$time)
kable(model_design, type="html")
| (Intercept) | samples$timeT3 |
|---|---|
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
Next, I will fit my dataset to the linear model by applying empirical Bayes, which uses my data to specify the baseline, or set a prior observation. Then, I will create a table in which the p-values and adjusted p-values for gene expression are calculated.
expressionMatrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(expressionMatrix) <- PM_exp_filtered$`Ensembl Gene ID`
colnames(expressionMatrix) <- colnames(PM_exp_filtered)[3:40]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
fit <- lmFit(minimalSet, model_design)
fit2 <- eBayes(fit,trend=TRUE)
topfit <- topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(PM_exp_filtered[,c(2,41)],
topfit,
by.y=0,by.x=1,
all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
#view hits
kable(output_hits[1:10,],type="html")
| Ensembl Gene ID | HUGO_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 10300 | ENSG00000196935 | SRGAP1 | 20.69417 | 45.12257 | 3.860254 | 0.0004222 | 0.9999188 | -4.595103 |
| 6417 | ENSG00000153283 | CD96 | 25.67070 | 51.88529 | 3.720099 | 0.0006360 | 0.9999188 | -4.595104 |
| 11960 | ENSG00000269821 | KCNQ1OT1 | 60.93707 | 186.82364 | 3.338608 | 0.0018822 | 0.9999188 | -4.595107 |
| 276 | ENSG00000020426 | MNAT1 | -20.75998 | 78.76815 | -3.305797 | 0.0020617 | 0.9999188 | -4.595107 |
| 11638 | ENSG00000248905 | FMN1 | 154.66557 | 441.55675 | 3.243284 | 0.0024497 | 0.9999188 | -4.595107 |
| 6728 | ENSG00000157933 | SKI | 302.04315 | 2099.74290 | 3.191329 | 0.0028240 | 0.9999188 | -4.595108 |
| 2544 | ENSG00000109790 | KLHL5 | -298.23446 | 2083.81718 | -3.188445 | 0.0028463 | 0.9999188 | -4.595108 |
| 318 | ENSG00000026652 | AGPAT4 | 218.70614 | 977.10470 | 3.143107 | 0.0032195 | 0.9999188 | -4.595108 |
| 8274 | ENSG00000170264 | FAM161A | 25.32429 | 59.14808 | 3.082983 | 0.0037863 | 0.9999188 | -4.595109 |
| 9815 | ENSG00000185862 | EVI2B | -4826.44290 | 22858.95871 | -3.058255 | 0.0040457 | 0.9999188 | -4.595109 |
Next, I want to see how many of my genes had significant enough expression before, and after adjustment.
#How many gene pass the threshold p-value < 0.05?
length(which(output_hits$P.Value < 0.05))
## [1] 223
#How many genes pass correction?
length(which(output_hits$adj.P.Val < 0.05))
## [1] 0
It seems that there are 223 genes that are significant before adjustment, but none after. This indicates that futher processing must be completed to get p-values that have been adjusted that are still significant.
Multiple Hypothesis Testing
A method that can be used to improve the results of the adjusted p-values is multiple hypothesis testing! This helps to control for patient variability, by taking both the patients’ gene expression AND the number of patients into account, instead of just the patients’ gene expression.
First, the linear model needs to be updated so that it now takes into consideration the patients’ gene expression and the number of patients.
model_design_pat <- model.matrix(~ samples$patients + samples$time)
model_design_pat[1:10,1:5]
## (Intercept) samples$patientsPM02 samples$patientsPM05 samples$patientsPM06
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 1 0 0
## 4 1 1 0 0
## 5 1 0 1 0
## 6 1 0 1 0
## 7 1 0 0 1
## 8 1 0 0 1
## 9 1 0 0 0
## 10 1 0 0 0
## samples$patientsPM08
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 1
## 10 1
Then, we again fit the model using empirical Bayes.
fit_pat <- lmFit(minimalSet, model_design_pat)
fit2_pat <- eBayes(fit_pat,trend=TRUE)
topfit_pat <- topTable(fit2_pat,
coef=ncol(model_design_pat),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits_pat <- merge(PM_exp_filtered[,c(2,41)],
topfit_pat,by.y=0,by.x=1,all.y=TRUE)
#sort by pvalue
output_hits_pat <- output_hits_pat[order(output_hits_pat$P.Value),]
kable(output_hits_pat[1:10,],type="html")
| Ensembl Gene ID | HUGO_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 6152 | ENSG00000149428 | HYOU1 | 224.10735 | 1584.97806 | 5.214442 | 0.0000409 | 0.2939791 | -4.531650 |
| 10178 | ENSG00000196218 | RYR1 | 18.61091 | 128.37796 | 4.784806 | 0.0001103 | 0.2939791 | -4.536787 |
| 4801 | ENSG00000135540 | NHSL1 | 26.21227 | 101.38153 | 4.722608 | 0.0001274 | 0.2939791 | -4.537575 |
| 8501 | ENSG00000172037 | LAMB2 | 74.16278 | 366.66467 | 4.718775 | 0.0001286 | 0.2939791 | -4.537624 |
| 8102 | ENSG00000168890 | TMEM150A | 107.76304 | 528.90158 | 4.701590 | 0.0001338 | 0.2939791 | -4.537844 |
| 11546 | ENSG00000242732 | RGAG4 | 66.27998 | 313.35280 | 4.667872 | 0.0001448 | 0.2939791 | -4.538278 |
| 11960 | ENSG00000269821 | KCNQ1OT1 | 60.93707 | 186.82364 | 4.608272 | 0.0001664 | 0.2939791 | -4.539054 |
| 276 | ENSG00000020426 | MNAT1 | -20.75998 | 78.76815 | -4.490611 | 0.0002191 | 0.3023822 | -4.540616 |
| 11254 | ENSG00000227460, ENSG00000197283 | NA | 46.40860 | 282.45028 | 4.488754 | 0.0002201 | 0.3023822 | -4.540641 |
| 3401 | ENSG00000119574 | ZBTB45 | -39.56290 | 135.70651 | -4.418770 | 0.0002593 | 0.3078740 | -4.541590 |
Next, we look at the p-values again like we did previously.
#How many gene pass the threshold p-value < 0.05?
length(which(output_hits_pat$P.Value < 0.05))
## [1] 933
#How many genes pass correction?
length(which(output_hits_pat$adj.P.Val < 0.05))
## [1] 0
Unfortunately, the results are again slightly disappointing in that there are still no genes with adjusted p-values that have significant gene expression.
When I was using Limma, I could not get any significantly differentially expressed genes, before or after multiple hypothesis testing.
I then re-read my paper and noticed that they also performed gene differential analysis, but used a different package than Limma, as the data seemed to follow a negative binomal distribution. Though Limma works well on data that follows a linear distribution, it should not be used by my data as my data follows a negative binomial distribution instead. EdgeR(MD, DJ, and GK 2010) is another package that can be used for differential expression, and is great for data that follows a negative binomial distribution - just like mine!
I will confirm that my data follows a negative binomial distribution by plotting the mean-variance graph of my data:
filtered_data_matrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(filtered_data_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
d = DGEList(counts=filtered_data_matrix, group=samples$time)
d <- estimateDisp(d, model_design_pat)
plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE,
NBline=TRUE,
show.binned.common.disp.vars = TRUE)
The blue line in the above graph is the negative binomial line. As you can see by the red X’s, my data follows the negative binomial distribution. Therefore, I can use the edgeR package on my dataset!
First, I will use the edgeR glmQLFTest to fit my dataset. I will fit my dataset using the multiple hypothesis test like I did with Limma, where I used both the number of patients and the pateints’ gene expression in an attempt to control for the patients’ variability.
#model_design_pat already is set to use both the number of patients and the patients' data
fit <- glmQLFit(d, model_design_pat)
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$timeT3')
kable(topTags(qlf.pos_vs_neg), type="html")
|
|
|
|
Next, I will check to see if there are any genes with significant gene expression using this new model.
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue", n = nrow(PM_exp_filtered))
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 2911
length(which(qlf_output_hits$table$FDR < 0.05)) #FDR is adjusted p-values
## [1] 138
Fortunately, there are 138 genes with significant gene expression, even after adjustment!
Next, I will plot the data on a heatmap(Gu, Eils, and Schlesner 2016), and look at the differences between patients before surgery (T0) and after surgery (T3).
top_hits <- rownames(qlf_output_hits$table)[qlf_output_hits$table$FDR<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits),pattern = "T0"), grep(colnames(heatmap_matrix_tophits),pattern = "T3"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
The heatmap indicates that there is no or very little clustering by conditions. The expression levels vary in both conditions to a large degree, though there seems to be slightly less gene expression before surgery (T0), indicated by a purple colour and there is slightly more gene expression after surgery (T3), indicated by a red colour.
Finally, I will use an MA plot to look at genes of interest in my dataset:
First, I will use a volcano plot to plot the upregulated genes; they can be seen as the orange circles on the graph.
plotVolcano <- cbind(qlf_output_hits$table$logFC, -log10(qlf_output_hits$table$FDR))
colnames(plotVolcano) <- c("logFC", "P-val")
upregulated <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 1
downregulated <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
point.col <- ifelse(upregulated, "orange", "black")
plot(plotVolcano, col = point.col)
Next, I will use a volcano plot to plot the downregulated genes; they can be seen as green circles on the graph.
point.col <- ifelse(downregulated, "green", "black")
plot(plotVolcano, col = point.col)
I am going to continue to use edgeR as it allowed me to find genes that passed correction. With edgeR, I will be finding the downregulated and upregulated genes from my dataset.
d <- calcNormFactors(d)
fit <- glmQLFit(d, model_design_pat)
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$timeT3')
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(filtered_data_matrix))
length(which(qlf_output_hits$table$PValue < 0.05 & qlf_output_hits$table$logFC > 0))
## [1] 732
length(which(qlf_output_hits$table$PValue < 0.05 & qlf_output_hits$table$logFC < 0))
## [1] 709
I was able to find that there are 732 upregulated genes, and 709 downregulated genes! Now I will find the ensembl gene IDs of these genes and save them in text files, which I will put into the g:profiler(Raudvere et al. 2019) webpage so that I can see what datasets match those genes best.
qlf_output_hits_withgn <- merge(PM_exp_filtered[,1:2],qlf_output_hits, by.x=2, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
#Get all of the gene sets
upregulated_genes <- qlf_output_hits_withgn$`Ensembl Gene ID`[which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$`Ensembl Gene ID`[which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)]
all_genes <- c(downregulated_genes, upregulated_genes)
#Save all gene sets to text files to be able to use in g:profiler
write.table(x=upregulated_genes,
file=file.path("data","PM_upregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file=file.path("data","PM_downregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=all_genes,
file=file.path("data","PM_all_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
Now that I have text files with all of the necessary genes, I will use these text files in g:profiler. The results can be found in images below.
The annotation data sets that I used were GO biological process, Reactome, and WikiPathways. There were no results for WikiPathways, so this annotation data set results will not be shown below. I used these annotation data sets because I am familiar with how they all work from previous assignments, as well as they all have important strengths that can be leveraged. Reactome is useful in identifying full molecular details of a pathway, while GO is useful to learn more about the outcome or ending state of molecular functions. I am using version e98_eg45_p14_ce5b097 of the annotation data sets.
I used the same thresholds from when I performed gene expression analysis; again I will be looking at FDR (Benjamini-Hochberg method) that are below 0.05, as this is the usual threshold that is set for other quantifiers such as p-values.
The amount of genesets returned from each annotation data set for each query performed are shown below.
Up-Regulation Results
Go biological pathways: 23 genesets
Reactome: 53 genesets
Down-Regulation Results
GO biological pathways: 91 genesets
Reactome: 3 genesets
Combined Up-Regulation and Down-Regulation Results
GO biological pathways: 101 genesets
Reactome: 14 genesets
Results for up-regulation are as follows:
The top term genesets returned have to do with response to heat and transporting mRNA.
Results for down-regulation are as follows:
The top term genesets returned have to do with building and breaking down cells, and actions of blood cells (oxygenating, deoxygenating, degranulation).
Results for both down-regulation and up-regulation are as follows:
The top term genesets returned have to do with SUMOylation of different proteins, and metabolic and catabolic processes.
Comparing GO results from all
It seems that all the results for GO invovled some form of regulation or organization of different parts or aspects of the cell. It seems that for the results where both down-regulation and up-regulation were included, the top term genesets returned were a combination of the results from down-regulation and up-regulation, as to be expected when combining the genes together.
Comparing Reactome results from all
It seems that the results for Reactome were again, a combination of both genesets returned from down-regulation and up-regulation. However, as opposed to GO, instead of having an almost equal spread of top term genesets between down and up-regulation, most of the top term genesets were from the up-regulation rather than the down-regulation.
In the paper I chose, they also performed gene expression analysis and thersholding, so I could compare my results with theirs.
The paper specifies that the upregulated genes were “mostly involved in the basal cellular machinery.” Basal cellular machinery has to do with forming the basic building blocks of the cell. This aligns with my results, as my results have to do with transporting mRNA, which is one of the basic building blocks to create cells.
The paper specifies that the down-regulated genes were “enriched in metabolic functions of adipose tissue”. This corresponds well with my results, since the results I found were genesets that broke and created cells, which are metabolic functions.
There are other papers who perform research to observe the effects after bariatric surgery on another’s body.
Another article(Santos et al. 2014) I found performed research on different aspects of patients’ physical and cellular heath before and three months after bariastric surgery. The results that they found were that the patients’ weight went down, and there were decreased neutrophil and triglyceride counts. The results that I found result were similar to the results of this paper as the genesets that were downregulated were actions of blood cells (like nutrophils), as well as breaking down cells like triglycerides. Also, my results indicate that response to heat is up-regulated, leading me to believe that the increased response to heat causes the patients to lose weight, just as the paper described.
There was also an article(Sparks et al. 2015) that reported findings on arthritic patients’ health before and after bariatric surgery. The results were similar to the article above, where the patients again lost weight, but in addition, in this article it was documented that there was a lower erythrocyte sedimentation rate. This again matches with my results, as a down-regulated gene was oxygentation and deoxygenation of blood cells, which means that erythrocyte sedimentation rate will decrease as a result since the erythrocytes ability to fall quickly.
Now I have gotten gene names that match the genes in the dataset, normalized my original dataset and gotten the ranks of all the genes in teh dataset. I will preform further enrichment and pathway analysis below on my dataset.
Before preforming GSEA(Subramanian et al. 2007), we need the rank (.rnk) file and the .gmt file.
To get the ranked file:
qlf_output_hits_withgn = qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank, decreasing=TRUE),] # get ranks from table
rnk_file_info = qlf_output_hits_withgn[c("Gene Name", "rank")] # create a table with just the gene name and associated rank
write.table(x=rnk_file_info,
file=file.path("data","A3_rnk_file.rnk"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE) # save the table to a .rnk file
To get the (February 1, 2020 version from the Bader lab) .gmt file:
gmt_url = "http://download.baderlab.org/EM_Genesets/February_01_2020/Human/symbol/" # use the Feb 1, 202o .gmt file
# list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred from
# electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents,
perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path("~/Desktop/Year3/BCB420/ABC-units", gmt_file)
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file) # save the file
Click on Load Data, and load both the .rnk and .gmt files that were listed above by clicking “browse for files”, selecting the two files and then clicking “Load these files!”
Go to the “Run GSEAPreranked” tab and select as the “Gene sets database” the .gmt file you imported
Use the .rnk file as the rank file
In the basic fields, select the min size to be 15, and the max size to be 200
Click run at the bottom, and wait for the status on the left bar to change from Running to Success
When the status bar shows Success, click on it and a .html file should open in your browser!
To get more details about your results, in the na_pos and na_neg, you can select “Detailed enrichment results in html format” and explore there
Upregulated Results:
Some results from the positive top term are:
Name: OSSIFICATION%GOBP%GO:0001503
P-value: 0.000
ES: 0.59
NES: 0.213
FDR: 0.011
Number of Genes in Leading Edge: 19
Top Gene Associated with this Gene Set: EXT1
Downregulated Results:
Some results from the negative top term are:
Name: FORMATION OF A POOL OF FREE 40S SUBUNITS%REACTOME DATABASE ID RELEASE 71%72689
P-value: 0.000
ES: -0.64
NES: -2.48
FDR: 0.000
Positive / Upregulated Results
A straightforward comparsion and commonality could be made between the REACTOME results of the thresholded method and the non-thresholded method. The common theme illustrated in both method’s results were building of necessary bodily materials. In the REACTOME results, common term names invovled transport of mature mRNA and SUMOylation. Both of these things are involved in the process of creating more DNA and proteins for the body. Transporting mRNA can lead to proteins being created, and SUMOylation is an aspect of post-translational modifications of proteins. Both of these aspects are important for creating finalized products of proteins. From the non-thresholded method, common results involved aspects of bone development and DNA development. The aspects of bone development that were results from the non-thresholded method were ossification, osteoblast differentiation and bone development. The various forms of DNA development that were results from the non-thresholded method were peripheral nervous system development, miRNA biogenesis and eye development. All of the different forms of development and growth as results from the thresholded and non-thresholded methods illustrate an upregulation in creation and building of various necessary bodily features. Though on different scales (REACTOME was more micro and the non-thresholded method involved more overall organs / systems) both methods involved upregulation of various materials needed to create and sustain bodily materials.
The GO threhsolded results for the upregulated results seemed to contradict the non-thresholded results, as the GO thresholded results focused on catabolic processes, which is when something is breaking down. This is the opposite of the non-thresholded results, who focused on building things instead of destroying them. Therefore the GO thresholded results did not have a straightforward similarity with the non-thresholded upregulated results but instead a straightforward discrepancy.
Negative / Downregulated Results
At first when I compared the results of the thresholded and non-thresholded experiments, nothing really jumped out at me. But once I took a step back and looked at the overall themes that could be seen from both experiments, it seemed that there actually was a straightforward commonality. In the thresholded results, an overall theme that appeared was inflammation and infection. This was evident from the term names neutrophil degranulation, leukocyte degranulation and other names of cells activated during an immune response. As these were all aspects that took part in inflammation and infection and were downregulated, it lead me to believe that inflammation and infection were generally downregulated in the patients after they had bariatic surgery. This aligns with results from the non-thresholded method, like the results of downregulation of viral mRNA translation and silencing of ceruloplasmin expression. Downregulation of viral mRNA translation indicates to me that there is less viruses occurring in the patients and therefore less of a possibility and necessity for inflammation and infection to occur. In addition, I researched what would be a reason for too much ceruloplasmin expression (which is the opposite of our results). The results from this research were that too much ceruloplasmin indicated that a serious infection was occuring in the body. As ceruloplasmin expression was downregulated, then it is likely that there was less occurrences of serious infections. This is the same overarching theme as the thresholded experiment. Therefore, infection and inflammation (resulting from infection) were able to be seen and compared to be downregulated in both the thresholded and non-thresholded results.
Now that I have my GSEA results, I can use these to create an enrichment map on Cytoscape(Shannon et al. 2003).
Apps > Enrichment Map
Click on the file folder and import the results from the GSEA method
Check the box next to “Show Advanced Options” to be able to select the p-value cutoff and the FDR q-value cutoff
Click the “Build” button
I first tried to create an enrichment map using the parameters of a p-value of 0.05 and an FDR value of 0.05. However, whenever I did this it only displayed mostly downregulated results (as demonstrated in blue):
So I decided to edit my parameters slightly so I would have more results for upregulated genes. To be able to have this occur, I adjusted my parameters to have a FDR value of 0.25 and the same p-value as before of 0.05. When I did this, I was able to have more results for both upregulation (red) and downregulation (blue):
In the final network that I created, I clicked on the “show advanced options” to be able to select the p-value of 0.05 and the FDR value of 0.25.
In this network, there are 112 nodes and 516 edges.
Apps > AutoAnnotate > New Annotations Set
On the Quick Start page, I left the “Annotate entire network” selected
On the Advanced page, I selected the button “Create singleton clusters” (since I had lots of clusters with a single gene inside)
Select “create annotation”
I used the resulting network from the previous steps to create my publication ready figure, by adding a figure legend:
The various themes can be analyzed by looking at the bolded words that surround various groups of genes (ex: complex cytoplasmic translation)
As there are lots of singleton clusters in my network, there are lots of different themes present in my analysis. These themes being shown are very similar to my results from the thresholded and non-thresholded methods. Themes like ossification, ear development, peripheral nervous system development are upregulated in the network, and are seen as upregulated in the non-thresholded method results. Themes like heme metabolism and antimicrobial humoral response are downregulated in the network and align with the thresholded and non-thresholded theme results as well.
Another theme that was seen in the network (but not really pointed out in the thresholded and non-thresholded methods) was pid insulin pathway. This theme was downregulated, which makes logical sense to me as the paper analyzed patients before and after bariatic surgery. The overall purpose of bariatic surgery is to lose weight by making the patients stomachs smaller, making the patients eat and drink less. If you are eating less, then you require less insulin. Insulin is a hormone that aids in signalling for the food eaten to be transformed into energy and storage. Since there is less food being eaten, then there is less food that can be transformed into energy and storage, and therefore a lower requirement of insulin.
File > Import > Network from Public Databases…
In “Data Source” select WikiPathways
Search WIP179 (WikiPathway code for folate metabolism)
Click on the folalte metabolism column
Click on “Import as Network”
Now, to be able to see the colour gradient (blue, white, red) based on the rank of genes in this pathway:
File > Import > Table from File
Select the .rnk file from earlier
In the control panel, move to Style
Select Fill Colour, and then select the name of your .rnk file
I chose this downregulated pathway as it matched with some of the large themes that were downregulated in the network. Folate metabolism relates to metabolism, oxidation and inflammation. Inflammation is shown as a downregulated theme in the non-thresholded method used as well as it relates to the theme of infection illustrated in the thresholded downregulated results. Furthermore, insulin (which relates to metabolism) was shown to be a downregulated theme in the network as well which was not as clearly shown in the thresholded / non-thresholded methods, so it would be interesting to use a pathway that incorporates that as well.
I imported my rank file into the folate metabolism pathway to be able to see how the different genes in the pathway had been affected in the patients after the bariatric surgery. There were various downregulated (blue) and upregulated (red) genes shown in the pathway. I found information about as many of the downregulated and upregulated genes in the pathway as I could, and it seems that the results of the upregulaltion and downreulgation of these genes match with the results found from other methods I have performed (thresholded and non-thresholded methods)!
The most obviously downregulated (darkest blue node) gene seen in the pathway was HBA1. When looking at the pathway, it can be noticed that HBA1 originates from glucose. Glucose comes from the breakdown of food, which occurs when the hormone insulin is released. It was stated earlier in the enrichment map results that the pid insulin pathway was downregulated; meaning less insulin was being released, and therefore less glucose was being made. If less glucose is made, then less HBA1 would be produced as HBA1 is produced from glucose. Therefore the downregulation of HBA1 aligns with our previous enrichment map results.
HBB is another downregulated gene that can be seen in the pathway. It is located in an almost identical position pathway-wise to that of HBA1, (in which it originates from glucose), and therefore also makes logical sense to be downregulated just as HBA1 is.
MTHFS(“MTHFS Gene (Protein Coding) Methenyltetrahydrofolate Synthetase” 2020) was seen to be a lighter shade of blue, nonetheless showing to be downregulated. After some research, MTHFS is a gene that supplies carbon for the biosynthesis of purines, thymine and amino acids. Carbons originate from glucose. As previously mentioned, glucose is downregulated due to the lack of food entering the patient’s body after the surgery since less insulin is being used for signalling, so less food is taken in to be broken down and used. Therefore with the lack of insulin comes the lack of glucose, and there will therefore be a deficit in carbon molecules. If there is a deficit of carbon molecules, the MTHFS gene has a lower ability to perform its task of supplying carbon to make amino acids and other biological products. Therefore the downregulation of MTHFS aligns with the results of the downregulation of the pid insulin pathway from the enrichment map.
The gene MPO(Medicine 2020b) was downregulated as well. MPO is a gene that is most abundantly expressed in neutrophil granulocytes and functions to produce hypohalous acids which carry out antimicrobial activity. This means that MPO is a gene that is involved with immune response to infections or other illnesses. The downregulation of MPO aligns with the results of downregulation of inflammation and infection as seen from the thresholded and non-thresholded methods. If MPO is downregulated, then there is less hypohalous acids which means there is less antimicrobial activity. The lack of antimicrobial activity leads me to believe that there is no need for it, meaning that there is less infections and less inflammation, as seen by the thresholded and non-thresholded results.
Though this pathway was shown to be downregulated, there were some upregulated genes shown in it. This suprised me at first, however after some research it became more clear to me how the upregulation of those genes allowed the overall pathway to be downregulated and align with previous results.
IL10(“IL10 Gene (Protein Coding) Interleukin 10” 2020) was one of the genes that were upregulated. IL10 is a gene that functions to downregulate the expression of substances secreted from cells of the immune system. Therefore, if IL10 is upregulated, then more secreted substances of immune system cells are downregulated. Our previous results indicated that the immune result was downregulated, which is exactly what the upregulation of IL10 is doing! Therefore, the upregulation of IL10 matches our previous results.
LDLR(Medicine 2020a) is another upregulated gene. LDLR stands for low-density lipoprotein receptor. It functions to collect low-density lipoproteins (LDL) from the bloodstream and break them down into cholesterol to either be used or excreted from the body. LDL is a very harmful form of lipoprotein, and is usually referred to as the “bad” cholesterol as a high quantity of LDL can lead to an unwanted buildup of cholesterol in the arteries. Therefore, the upregulation of LDLR allows your body to be more “healthy”, as it is breaking down and reducing the amount of LDL in your body. Therefore, the more low-density lipoprotein RECEPTORS in your body, the less low-density lipoprotein in your body! In my opinion, though this finding does not exactly align with any of the previous results found, it does align with the overarching analysis and point of the paper. The point of the paper was to analyze patients before and after bariatric surgery. Bariatric surgery is used in an attempt to make people healthier and less overweight. Having less LDL in your body makes you healthier, which aligns with the overall results of the paper and the purpose of the surgery to be performed.
As explained in various earlier sections, there are lots of commonalities between the results of the thresholded method as well as the non-thresholded method. As the thresholded results were already described (in A2) to be very similar to that of the paper, the same is the case for the non-thresholded method! Downregulated trends explained in the paper involved metabolic functions of adipose tissue; more specifically, immune responses of adipose tissue, antiviral factor and neutrophil-mediated inflammation. Downregulation in inflammation can be seen very explicitly in the results and analysis of the folate metabolism pathway. In addition, the downregulation of antiviral factor is very similar to the downregulation of antimicrobial humoral response, as they are both used to prevent harmful organisms or viruses from entering and harming the body. The downregulation of both also align with the results of decreased inflammation, since decreased need for protection against harmful substances would mean that there are less harmful substances in the body and therefore the body requires less of an inflammatory response.
Also addressed in the paper was after the bariatric surgery, there was a decrease in the patients having insulin resistance; meaning their body was able to still detect when insulin was being secreted and would respond to that hormone. This is commonly due to the body not releasing as much insulin. This aligns directly with the theme of downregulation of the pid insulin pathway; meaning there was less insulin being secreted, meaning less of a chance of patients developing insulin resistance (if insulin is not there, how would one develop resistance to it?).
As the downregulation of inflammation and neutrophil production from the non-thresholded method were similar to the thresholded method, these results also align with the paper(Santos et al. 2014) previously discussed (in A2). In this paper, it was explained that neutrophil production was decreased in patients after the bariatric surgery. This was identical to the results in the non-thresholded method, where neutrophil products were also decreased (explained in more detail in the folate metabolism pathway analysis). In addition, this paper mentioned that there was decreased quantities of triglycerides. This is similar to the non-thresholded results, as the pid insulin pathway was also downregulated. If the insulin pathway was reduced, then there is less of a hormone signal indicating that food taken in needs to be used or stored; meaning less food was convered into triglycerides. Therefore a downregulation of insulin equates to a downregulation of triglycerides. Therefore, the paper my data is based off of as well as this other paper provide similar results.
Another paper(Graessler et al. 2013) found similar downregulation results as my non-thresholded method results. Again, the paper indicated downregulation in inflammatory activity, as well as a downregulation in glucose activity (they described it as glucose stabilization in the patients - which looking more into the paper meant that the amount of glucose and its activity decreased). Inflammation and glucose activity were also decreased in my results, as earlier explained by the downregulation in antimicrobial humoral response, the downregulation of the folate metabolism pathway, and the downregulation of the pid insulin pathway.
Graessler, J, Y Qin, H Zhong, J Zhang, Julio Licinio, Ma-Li Wong, A Xu, et al. 2013. “Metagenomic Sequencing of the Human Gut Microbiome Before and After Bariatric Surgery in Obese Patients with Type 2 Diabetes: Correlation with Inflammatory and Metabolic Parameters.” The Pharmacogenomics Journal 13 (6). Nature Publishing Group: 514–22.
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.
“IL10 Gene (Protein Coding) Interleukin 10.” 2020. GeneCards the Human Gene Database.
MD, Robinson, McCarthy DJ, and Smyth GK. 2010. “edgeR a Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (7): 139–40.
Medicine, U.S. National Library of. 2020a. “LDLR Gene, Low Density Lipoprotein Receptor.” Genetics Home Reference.
———. 2020b. “MPO Gene, Myeloperoxidase.” Genetics Home Reference.
Morgan, Martin. 2019. BiocManager: Access the Bioconductor Project Package Repository. https://CRAN.R-project.org/package=BiocManager.
“MTHFS Gene (Protein Coding) Methenyltetrahydrofolate Synthetase.” 2020. GeneCards the Human Gene Database.
Poitou, Christine, Claire Perret, François Mathieu, Vinh Truong, Yuna Blum, Hervé Durand, Rohia Alili, et al. 2015. “Bariatric Surgery Induces Disruption in Inflammatory Signaling Pathways Mediated by Immune Cells in Adipose Tissue: A Rna-Seq Study.” PLoS One 10 (5). Public Library of Science.
Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “g:Profiler a Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update.” Nucleic Acids Research. https://doi.org/10.1093/nar/gkz369.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Santos, J, P Salgado, C Santos, P Mendes, J Saavedra, P Baldaque, L Monteiro, and E Costa. 2014. “Effect of Bariatric Surgery on Weight Loss, Inflammation, Iron Metabolism, and Lipid Profile.” Scandinavian Journal of Surgery 103 (1). SAGE Publications Sage UK: London, England: 21–25.
Shannon, P, A Markiel, O Ozier, N S Baliga, J T Wang, D Ramage, N Amin, B Schwikowski, and T Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13 (11): 2498–2504. https://doi.org/10.1101/gr.1239303.
Sparks, Jeffrey A, Florencia Halperin, Jonathan C Karlson, Elizabeth W Karlson, and Bonnie L Bermas. 2015. “Impact of Bariatric Surgery on Patients with Rheumatoid Arthritis.” Arthritis Care & Research 67 (12). Wiley Online Library: 1619–26.
Subramanian, Aravind, Heidi Kuehn, Joshua Gould, Pablo Tamayo, and Jill P Mesirov. 2007. “GSEA-P: A Desktop Application for Gene Set Enrichment Analysis.” Bioinformatics 23 (23). Oxford University Press: 3251–3.
Zhu, Yuelin, Sean Davis, Robert Stephens, Paul S. Meltzer, and Yidong Chen. 2008. “GEOmetadb: Powerful Alternative Search Engine for the Gene Expression Omnibus.” Bioinformatics (Oxford, England) 24 (23): 2798–2800. https://doi.org/10.1093/bioinformatics/btn520.